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(57) Abstract: Tin* present invention relates (o mechanisms lor hantllinL' and detechmj collisions between threads (:>. (>. 7) thai 
execute eompuier program instructions out ofpro^nun order. According to an emhudimeni of the present inventiun each of a plurality 
o!" threads f5. 6. 7i are assoeialetl with a respective data structure 10. 1 I > comprising a number of bits (12) thai correspond to 
memory element.' Uh (i . m . m : . m,.i of a siiareci memon f-h. W hen a threat! accesses a memon element in liie shared memory i; 
sch a hit in its associateii data structure, which bit corresponds \v trie accessed memory element. Tbis indicates thai the memory 
eiemeni ha- been accessed h* the thread ( 'nlli>H»n deteciion may In- carried on' alter lie- thread lias finished executing by means ui 
coninanne the data structure ol the thread with the da::: siructure: o| otiier threads on which the thread u;a\ depend. 
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COLLISION HANDLING APPARATUS AND METH OD 
FIELD OF THE INVENTION 

The present invention relates m general to execution of computer program 
-sections, and more specifically to thread-based speculative execution of 
computer program instructions out of program order. 

BACKGROUND OF THE INVENTION 

The performance of computer processors has been tremendously enhanced 
over the years. This has been achieved both by means of malting operations 
faster and by means of increasing the parallelism of the processors, i.e. the 
ability to execute several operations in parallel. Operations can for instance be 
made faster by means improving transistors to make them switch faster or 
optimizing the design to minimize the level of logic needed to implement a 
given function. Techniques for parallelism include processing computer 
program instructions concurrently in multiple threads. There are programs that 
are designed to execute in several concurrent threads, but a program that is 
designed to execute in a single thread can also be executed in several 
concurrent threads. If the execution of a program m several concurrent 
tlueads causes program instructions to be executed in an order that differs 
from the program order in which the program was designed to execute the 
thread execution is speculative. The discussion hereinafter focuses on such 
speculative thread execution. 

A computer program that has been designed to be executed in a smde thread 
can be parallelised by dividing the program flow into multiple threads and 
speculatively executing these threads concurrently usually on multiple 
processing units. The international patent application describes 
techniques that may be used to divide a program in.o multiple threads 
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However, if the threads access a shared memory, collisions between the 
concurrently executed threads may occur. A collision is a situation in winch the 
threads access the shared memory in such a way that there is no guarantee that 
the semantics of the original single-threaded program is preserved. 

5 A collision may occur when wo concurrent threads access the same memory 
element in the shared memory. An example of a collision is when a first thread 
writes to a memory element and the same memory element has already been 
read by a second thread which follows the first thread in the program flow of 
the single-threaded program. If the write operation performed by the first 

10 thread changes the data in the memory element, the second thread will read 

the wrong data, which may give a result of program execution that differs from 
the result that would have been obtained if the program had been executed in 
a single thread. Depending on the implementation, collisions can for example 
also occur when two threads write to the same memory element in the shared 

15 memory. 

Execution of a computer program in multiple concurrent threads is intended 
to speed up program execution, without altering the semantics of the program. 
It is therefore of interest to provide a mechanism for detecting collisions. 
When a collision has been detected one or more threads can be rolled back in 

2 0 order to make sure that the semantics of die single-threaded program is 
preserved. A rollback involves restarting a thread at: an earlier point in 
execution, and undoing everything that has been done by the thread after that 
point. In the example above, in which the older first thread wrote to a memory 
element that already had been read by the younger second thread, the second 

2 5 thread should be rolled back at least to the point when the memory element- 
was read, if it is to be guaranteed thai the semantics of the single- threaded 
program is preserved. 
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A known mechanism for detecting and handling collisions involves keeping 
track of accesses to memory elements by means of associating two or more 
flag bits per thread with each memory object. One of these flag bits is used to 
indicate that the memory object has been read by the thread, and another bit is 
used to indicated that the memory object has been modified by the thread. 

The international patent application WO 00/70450 describes an example of 
such a known mechanism. Before a primary thread writing to a memory 
element in a shared memory, status information associated with the memory 
element is checked to see if a speculative thread has read the memory element. 
If so, the speculative thread is caused to roll back so that the speculative thread 
can read die result of the write operation, 

A disadvantage of tins known mechanism when implemented in software is 
diat it results in a large execution overhead due to the communication and 
synchronization between die threads dial is required for each access to die 
shared memory. The status information is accessible to several tiireads and a 
locking mechanism is therefore required in order to make sure that errors do 
not occur due to concurrent access to die same status information by two 
threads. There is also a need for memory barriers (also called memory fences) 
m order to ensure correct ordering between accesses to the shared memory 
and accesses to the status information. 

Another example of a known mechanism for detecting and handling collisions 
is described in Steffan J.G. et al., "The Potential for Using Thread-Level Data 
Speculation to Facilitate Automatic ParaUelization" Proceedings of the Fourth 
International Symposium on High-Performance Computer Architecture, 
February 1998, and in Oplinger J. et aL "Software and Hardware for 
Exploiting Speculative Parallelism with a Multiprocessor'*. Stanford University 
Computer Systems Lab Technical Report CSJ/J'K f >~ 7|.\ Fcbruarv 1 <»')". '\ n 
extended cache coherency protocol is used to support ■>;] >erukrivc dircads. 
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The flag bits are, according to this technique, associated with cache lines in a 
first level cache of each of a plurality of processors. When a thread performs a 
write operation, a standard cache coherency protocol invalidates the affected 
cache line in the other processors. By extending the cache coherency protocol 
5 to include the thread number in the invalidation request the other processors 
can detect read after write dependence violations and perform rollbacks if 
necessary. A disadvantage of this approach is that speculatively accessed cache 
lines have to be kept in the first level cache until the speculative thread has 
been committed, otherwise the extra information associated with each cache 
1 0 line is lost If the processor runs out of available positions in the first level 

cache during execution of the speculative thread, the speculative thread has to 
be rolled back. Another disadvantage is that the method requires modifications 
to the cache coherency protocol implemented in hardware, and cannot be 
implemented purely in software using standard microprocessor components. 

1 5 SUMMARY OF THE INVENTION 

As mentioned above the known mechanisms for handling and detecting 
collisions have some disadvantages. The problem solved by the present 
invention is to provide mechanisms that simplify handling and detection of 
collisions. 

20 A first object of the present: invention is to provide a device having simplified 
mechanisms for recording information regarding memory accesses to a shared 
memory. 

A second object of die present invention is to provide a simplified method for 
recording information regarding memory accesses to a shared memory. 

25 A third object of the present invention is to provide a simplified method for 
handling possible collisions between a plurality of threads. 
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The objects of the present invention are achieved by means of an apparatus 
according to claim 1, by means of a method according to claim 17 and by means 
of a method, according to claim 27. The objects of die invention are further 
achieved by means of computer program products according to claim 36 and 
claim 37. 

According to the present invention each of a plurality of threads are associated 
with a respective data structure for storing information regarding accesses to the 
memory elements of die shared memory. When a thread accesses a selected 
memory element in the shared memory, information is stored in its associated 
data structure, which information is indicative of the access to die selected 
memory element. According to an embodiment of the present invention 
collision detection is carried out after the thread has finished executing by means 
of comparing the data structure of die thread with the data structures of other 
threads on which die thread may depend. 

An advantage of the present invention is that each thread is associated with a 
respective data structure that stores die information indicative of die accesses to 
tlie shared memory. This is especially advantageous in a software 
implementation since each thread will only modify the data structure with which 
it is associated. The threads will read die data structures of other threads, but 
they will only write to their own associated data structure according to die 
present invention. The need for belting mechanisms is therefore reduced 
compared with die known solutions discussed above in which the information 
indicative of memory accesses were associated with the memory elements of die 
shared memory and were modified by all die threads. The reduced need for 
locking mechanisms reduces die execution overhead and makes the 
implementation simpler. In die software implementation, the absence of locks 
and memory barriers during thread execution will also o;ive a compiler more 
freedom to optimize the code. 
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Another advantage of die present invention is that, since it does not require a 
modified cache coherency protocol, it can be implemented purely in software, 
tiius malting it possible to implement die invention using standard components. 

Further advantages of embodiments of die present invention wiU be apparent 
5 from die following detailed description of preferred embodiments with 
reference to accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a schematic block diagram of a computer system in which the present 
invention is used. 

1 0 Figs. 2A and 2B are schematic diagrams that illustrate a computer program being 
executed in a single thread and divided into several threads respectively. 
Fig. 3A is schematic block diagram that illustrates how data structures according 
to the present invention are used. 

Fig. 3B is schematic block diagram that illustrates how an alternative 
1 5 embodiment of data structures according to the present invention is used. 

Fig. 4 is a flow diagram illustrating how reading from the shared memory may be 
performed according to die present invention. 

Fig. 5 is a flow diagram illusu-ating how writing to die shared memory may be 
performed according to the present invention. 
2 0 Fig. 6 is a schematic block diagram that illustrates dependence lists associated 
with threads according to the present invention. 

Fig. 7 is a flow diagram illustrating how a thread may be executed and a collision 
check for die thread may be made according to the present invention. 

2 5 DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS 

Fiyure 1 illustrates a computer system '1 including two central processing units 

i_ i. 

(CPUs) first CPU 2 and second CPU :>. The CPU? accesses a shared memory 4, 
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divided into a number of memory elements mO, ml, m2, ma The memory 
elements may for instance be equal, to a cache line or may alternatively 
correspond to a variable or an object in a source language. Figure 1 also shows 
tiiree threads 5, 6, 7 executing on die CPUs 2, 3. 

A thread can be seen as a portion of computer program code that is defined by 
two checkpoints, a start point and an end point. Figure 2a shows a schematic 
illustration of a computer program 8 comprising a number of instructions or 
operations, il, i2 v ..in. When the computer program is executed as a single 
thread, die normal way of processing die instructions is in the program order, 
i.e. from top to bottom in Figure 2A. It is however possible, according to known 
techniques as mentioned above, to divide the program into multiple threads. 
The program 8 may for instance be divided into die tiiree tiireads 5, 6, 7 as 
indicated in Figure 2A. The tiireads can be executed concurrentiy. Figure 2B 
illustrates an example of a tiireaded program flow, where die first CPU 2 first 
processes die diread 5 and dien die direacl 6, and die second CPU 3 starts 
processing thread 7 before die threads 5 and 6 have finished executing on the 
first CPU 2. 

Figure 2B shows an example of how die tiireads 5, 6, 7 may execute. Many otiier 
alternative ways of executing die threads are however possible. It is for instance 
not necessary diat die first CPU 2 finishes processing the diread 5 before 
starting on die thread 6 and die thread 6 may be executed before die thread 5. 
The first CPU 2 may be a type of processor that is able to switch between 
several different threads such that die CPU 2 e.g. starts processing die diread 5, 
leaves the diread 5 before it is finished to process die diread 6 and then returns 
to the diread 5 again to continue where it left off. Such a processor is sometimes 
called a Fine Grained Multi-Threading Processor. A Simultaneous Multi- 
Threading fSMT, Processor is able k> process- several threads in parallel, so if the 
CPU 2 is such a processor it is able to process, die rhreads :•. 0 simultaneously. 
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Thus, it is not necessary to have multiple CPUs in order to process multiple 
threads concurrently. 

Collisions may occur between the threads 5, 6, 7 when die instructions of die 
computer program 8 are executed out of program order. As mentioned above, a 
5 collision is a situation in which the tiireads access the shared memory 4 in such 
a way that there is no guarantee diat die semantics of the original single- 
threaded program 8 is preserved. It is therefore of interest to provide 
mechanisms for detecting and handling collisions that may arise during 
speculative thread execution. 

10 According to die present invention each thread 5 3 6, 7 is associated with a data 
structure 9, 10, 11, which is illustrated schematically in Figure 1. The data 
structure is used to store information indicative of which memory elements in 
die shared memory 4 that die respective diread has accessed. According to an 
embodiment of die present invention each data structure includes a number of 

15 bits 12 that correspond to die memory elements in the shared memory. 
According to die embodiment of the present invention shown in Figure 1 the 
bits 12 of each data structure 9, 10, 11 are divided into a load vector 9a, 10a, 11a 
and a store vector 9b, 10b, lib. For each memory element mO, ml, m2, inn in 
die shared memory 4, there is exactly one corresponding bit 12 in die load 

20 vector and exactiy one corresponding bit 12 in die store vector associated with 
each diread. When die diread 6 reads from a memory element, it sets die 
corresponding bit 12 in die load vector 9a to indicate that die memory element 
has been read. The store vector 9b is updated analogously when die diread 6 
writes to die shared memory. 

2 5 There can either be a one-to-one correspondence or a many-to-one 
correspondence between die memory elements and die bits in die load and store 
vectors. By having a many-to-one correspondence, die memory overhead is 
reduced at. die cost of spurious collision-, which causes slower execution. 
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Reducing die memory overhead will however also result, in reduced execution 
overhead, since there will be fewer cache misses. A hash function can be used to 
map a number of a memory element to a bit position in the load and store 
vectors. 

5 Figure 3A illustrates an example of how the data, structures 9, 10, 11 are used 
according to the present invention. In this example die thread 5 has written to 
the memory elements ml and m4 and read memory elements ml, m5 and m8. 
The thread 6 has written to the memory elements m2, m6 and m9 and read the 
memory elements m2, m6 and ml 3. The thread 7 has read die memory element 

10 ml2. In this example, there are more memory elements in the shared memory 
than there are bit positions in the load and store vectors, which means that there 
is a many-to-one correspondence between the memory elements and the bits in 
the load and store vectors. In this example the bit position in the load and store 
vector diat corresponds to a selected memory element is found using a hash 

1 5 function, winch in this example simply calculates the remainder when dividing 
the number of the memory element by die size of the load and store vectors. 
Tins means that when die thread 5 writes to die memory elements ml, it sets die 
bit in position number 1 in its store vector and when die thread 6 writes to die 
memory element m9, it sets die bit in position number 1 in its store vector. 

2 0 When die threads have performed die write and read operations mentioned 
above, die bit position numbers that are set will be 0, 1, 5 for the load vector 9a; 
1, 4 for die store vector 9b; 2, 5, 6 for die load vector 10a; 1, 2, 6 for die store 
vector 10b and 4 for die load vector 11a. This is illustrated in Figure 3 A by 
means of filled boxes representing the bits that are set. 

L'5 The implementation of the present invention can be simplified by means of the 
data structures 9, ) 0, 11 each comprising a single combined load and store- 
vector instead of a separate load vector and a separate store vector. Figure 3P> 
illustrates die same example as described above with reference to Figure 3 A. 
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with the only difference that the data structures 9, 10, 11 each includes a single 
combined load and store vector 9c, 10c, lie instead of die load vectors 9a, 10a, 
11a and die store vectors 9b, 10b, lib. The bit positions drat are set in die 
combined load and store vector 9c correspond to a logical bitwise inclusive or 
operation of the load vector 9a and store vectors 9b shown in Figure 3B. 

The embodiment of die present invention wherein die data strucaires includes a 
single combined load and store vector results in an increased number of 
spurious collisions, but on die otiier hand it also results in a reduced need for 
memory to store die data structures and a reduced, number of operations when 
checking for collisions, as will be discussed further below. 

The embodiments of die present invention shown in Figures 3A and 3B uses a 
type of data versioning called privatisation, which means tiiat a private copy 14 
of a memory element diat is to be modified is created for the tiiread tiiat 
modifies die element. The tiiread dien modifies the private copy instead of die 
original memory element in die shared memory. The private copies contain 
pointers 15 to tiieir corresponding original memory element in die shared 
memory. The private copies are used to write over the original memory elements 
in die shared memory 4 when die direads for which tiiey were created are . 
committed. If a tiiread is rolled back, its associated private copies 14 are 
discarded. Figure 4 shows a flow diagram illustrating how reading from the 
shared memory is performed when privatisation is used. Figure 5 shows a 
corresponding flow diagram for writing to die shared memory. 

Figure 4 shows a first step 20, wherein die memory element to be read is marked 
as read in die load vector. In step 21. it is examined whether or not die tiiread 
has a private copy of die memory element: to be read. If a private copy exists die 
data is read from die private copy, step 22. If tiiere is no private copy die data is 
read from die memorv element in die shared memory, step 23. 
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Figure 5 shows a first step 25 3 wherein it is examined whether or not the thread 
has a private copy of die memory element to be written to. If there is no private 
copy, die memo ry element to be written to is marked as written in the store 
vector, step 26, and a private copy is created, step 27. The data is dien written to 
die private copy, step 28. If a private copy is found to exist in step 25, die data 
can be written to die private copy directly, step 28, witiiout having to make a 
mark in die store vector or create die private copy. 

The privatisation described above is not a prerequisite of die present invention. 
Another type of data versiomng, which may be used instead of privatisation, 
involves that die threads store backup copies of the memory elements before 
they modify them. These backup copies are dien copied back to die shared 
memory during a rollback. 

The embodiments of die present invention described above comprise data 
structures in die form of bit vectors for storing information indicative the 
thread's accesses to die memory. However, many alternative types of data 
structures for storing this information are possible according to die present 
invention. The data structures may for instance be implemented as lists to which 
numbers that correspond to die memory elements are added to indicate accesses 
die memory elements. Other possible implementations of die data structures 
include trees, hash tables and other representations of sets. 

It will now be discussed how die thread associated data structures of die present 
invention can be used to check for and detect collisions. 

In a software implementation where the tiiread associated data structures of die 
present invention are used to check for collisions, a thread that has collided with 
another thread will itself detect die collision. In rht known mechanisms 
discussed above an older thread would detect if a younircr thread has collided 
and send a message about this so dial rbe vnunves thread would he rolk-d back. 
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This sending of messages takes time and causes an extra delay, which can be 
avoided by means of the present invention. 

According to a preferred embodiment of the present invention collision checks 
are performed after the thread has finished its execution and is about to be 
5 committed. The collision check is made by means of comparing die data 
structure associated with die thread to be checked with the data structures 
associated with other threads on which die thread to be checked may depend. In 
order to keep track of die possible dependencies between threads a dependence 
list may be created for each thread before it starts executing. Tins is illustrated in 
10 Figure 6, by means of the threads 5, 6, 7 which are associated with dependence 
lists 16, 17 and IS respectively. The dependence lists are lists of all older threads 
that had not yet been committed when die thread was about to start executing. 
The thread 7 may depend on threads 5 and 6 so its dependence list 18 contains 
references to threads 5 and 6 to indicate the possible dependency. 

15 The dependence list described above is just an example of how to keep track of 
possible dependencies between threads. The dependence list is not limited to a 
list structure but can also be represented as an alternative structure that can store 
information regarding possible dependencies. It is further not necessary for the 
dependence list to store a reference to all older not yet committed threads. For 

20 example in an implementation where forwarding is used it may be possible to 
determine that die thread to be started is not dependent on some of die older 
not yet committed threads and it is then not necessary to store a reference to 
these threads in die dependence list. In other cases die information stored in die 
dependence list may refer to an interval of threads of which some already have 

2 5 been committed when die dependence list is created. As long as die dependence 
list includes a reference to all the threads that die thread to be started depends 
on there is no harm in die dependence list also including references to some 
threads, that the thread to be started clearly doer not depend on. 
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Figure 7 shows a flow diagram of how a thread rmy be executed and a collision 
check for die tiiread may be made according to die present invention. In a step 
30, die dependence list for die thread to be executed is created. The thread is 
dien executed in a step 31. When die thread has finished executing, it waits until 
5 die threads that it may depend on have been checked for collisions and are ready 
to be committed, step 32. It dien compares its associated data structure to die 
data structures associated with die direads in the dependence list to check for 
collisions, step 33. If no collision is detected, die tiiread is committed in a step 
34, odierwise die diread is rolled back in a step 35. If the thread has collided 
10 with anotiier tiiread, die risk that die thread collides with the same diread again 
may be reduced by means of delaying the restart of die thread until the thread it 
collided witii has been committed. The system may be arranged to give higher 
priority to committing threads with winch other threads have collided. 

When the collision check is performed as described above, even the oldest not 

1 5 yet committed thread is specuiauve, since it might have collided with an earlier 

tiiread diat already has been committed and tins is not detected until the tiiread 
has finished its execution. However, when a diread has become the oldest not 
yet committed thread, it will have to be rolled back at die most once, since when 
it is restarted, there is no otiier thread that it can collide with. 

2 0 Alternatively one or several partial collision checks may be performed during 

execution, before performing die collision check when the tiiread has finished 
executing. The partial collision check can be performed without locking die data 
structures associated with otiier direads because it is acceptable that die partial 
check fails to detect some collisions. Collisions that were not detected in the 
2 5 partial collision check will be delected in die final collision check that i? 
performed after the thread has finished m execution. 

Jlie comparison berween ivvo data structure:- Ui dete'/t collisions is pertonncd 
clifiercntiy depending on whether or not the dam slruci.urcs includes separated 
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load and store vectors or a combined load and store vector. If die data 
structures have separated load and store .vectors the comparison between the 
load and store vectors of an older and a younger thread can be carried out by 
means of performing die following logical operations bitwise on the bit vectors: 

5 old store vector AND (young store vector OR young load vector). 

If die resulting vector contains any bits that are set there is a collision and the 
younger thread should be rolled back. If the data structures have combined load 
and store vectors the corresponding logical operation to be performed to check 
for collisions is an AND-operation between die combined vector of the older 
10 thread and the combined vector of the younger thread. 

In an alternative embodiment the compaiison to detect collisions is carried out 
by means of performing die following logical operation bitwise on the bit 
vectors: 

old store vector AND young load vector. 

15 This comparison assumes that die threads are committed in program, order and 
that when a write operation that only modifies part of a memory element (which 
corresponds to a read-modify-wnte operation) is carried out the corresponding 
bit in bodi die load and the store vector is set. 

An advantage of die collision check of die present invention is that since 
20 collisions do not have to be detected until die diread has finished executing, 
there is no need for any locking mechanism or memory barriers during 
execution. Tins reduces die execution overhead and makes die implementation 
simpler. Another reason why die execution overhead can be reduced according 
to die present invention is that if die collision check is only performed when the 
2 5 thread has finished execurins;. at most one check will have to be made for each 
accessed memoir dement, even if die element has- been accessed manv times 
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during execution. In die known mechanisms discussed above a collision check 
was performed in connection with each access to die shared memory. 

The cost of handling collisions according to die present invention is that 
collisions are not detected as early as possible, which results in some wasted data 
5 processing of threads that already have collided and should be rolled back. 
However, die gain in execution overhead will in many cases surpass die cost of 
not detecting collisions immediately. The collision check of the present 
invention described above is tiius particularly favorable when collisions are rare. 

According to die present invention, die only tiling that has to be performed in 
1 0 die same order as in die original single- tiireaded program is die collision check. 
Threads can be executed and rolled back out of program order and depending 
on the implementation sometimes also committed out of program order. 

If die many-to-one correspondence between die memory elements and die bits 
in die load and store vectors is used, die load and store vectors can have a fixed 
15 size. The memory overhead is dien proportional to die number of threads 
instead of the number of memory elements, which means that die amount of 
memory needed to store die data structures will remain die same when die 
number of memory elements in die shared memory increases. 

The present invention can be implemented botii in hardware and in software. In 
20 a hardware implementation it is possible to use a fast fixed-size memory inside 
each processor to store die data structures. In a software implementation a 
speed advantage will be obtained if die data structures are made small enough to 
be stored in the first level cache of die processor. Due to die frequent use of die 
data structures it will be advantageous to store them in as fast memory as 
2 5 possible. 

The data structure associated with a rhrc.fi d will namralh only have* to be stored 
in memory until die diread with which ii is assoaaicd and all tiircadr. thai mar 



WO (13/1(54693 



16 



PCT/SEd 1/0274 1 



depend on the thread are committed. Once the thread and all threads that may 
depend on it are committed the memory used to store its associated data 
structure can be reused. 

The present invention is not limited to any particular type of memory elements 
of a shared memory. The present invention is applicable to both logical and 
physical memory elements. Logical memory elements are for example variables, 
vectors, structures and objects in an object oriented language. Physical memory 
elements are for example bytes, words, cache lines, memory pages and memory 
segments. 

As described above a thread comprises a number of program instructions. Other 
terms for a series of instructions drat are sometimes used in the field. An 
example of such a term is job. 

Thread-level speculative execution with a shared memory has many similarities 
to a database transaction system. The entries of a database can be compared 
with tire elements of a shared memory and since a database transaction includes 
a number of operations, a database transaction can be compared with a thread. 
One way to ensure that a database remains consistent is to check for collisions 
between different database transactions. Thus the principles of the ideas of the 
present invention may be used also in this field. 

It is to be understood that the embodiments of the present invention discussed 
above and illustrated in the figures, merely serves as examples to illustrate the 
ideas of tire present invention and diat the invention in no way is limited to just 
die examples described. The examples are for instance, simple examples that 
only illustrate a few memory elements in die shared memory and a few bits m 
die data structures associated with die threads. In reality die number of memory 
elements and bits can be very .large. The present invention is further not limited 
to any particular number ofthreadr or CPUs. 
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CLAIMS 

1. An apparatus that supports execution of computer program instructions 
speculatively out of program order comprising: 

- a plurality of threads for executing computer program instructions, and 
5 - a shared memory, which comprises a number of memory elements accessible 
to die plurality of direads; 
wherein each of die direads are associated witii a data structure for storing 
information regarding accesses to die memory elements of die shared memory 
and wherein each of die direads has means for accessing a selected memory 
10 element in die shared memory and means for storing information in the 
associated data structure indicative of die access to die selected memory 
element. 

2. The apparatus according to claim 1, wherein the. data structures are one of the 
1 5 following types of structures: an unsorted list, a sorted list, a tree and a table. 

3. The apparatus according to claim 1, wherein each data structure comprises a 
number of bits that correspond to die memory elements of die shared memory 
and wherein die means for storing information are means for setting at least one 

2 0 chosen bit, which at least one chosen bit corresponds to die selected memory 
element. 

4. The apparatus according to claim 3, wherein die data structure comprises a 
load vector and a store vector, wherein die means for setting at least one chosen 

2 5 bit is arranged to set a bit in the load vector when die first thread accesses die 
selected memory clement in order in read il, and wherein the means for setting 
at least one chosen bit is arranged to set a bit in rbc store vector when the first 
thread accesses die selected meuion clcmcni in ordei u» write tu it. 
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5. The apparatus according to claim 3, wherein the data structure comprises a 
single combined load and store vector. 

6. The apparatus according to claim 4 or 5, wherein there is a one-to-one 
5 correspondence between the memory elements in the shared memory and the 

bits in the or each vector of the data structure. 

7. The apparatus according to claim 4 or 5, wherein there is a many-to-one 
correspondence between die memory elements in die shared memory and die 

1 0 bits in the or each vector of the data structure. 

8. The apparatus according to claim 7, wherein die correspondence between die 
bits in the or each vector and the memory elements is determined by a hash 
function that maps the memory elements to the bits in the or each vector. 

15 

9. The apparatus according to any of claims 1-8, wherein the apparatus further 
comprises means for checking whether a thread has a private copy of die 
selected memory object, means for creating a private copy of die selected 
memory object and means for reading and writing to a private copy of die 

20 selected memory object. 

10. The apparatus according to any of claims 1-8, wherein the apparatus further 
comprises means for storing a backup copy of die selected memory element. 

2 5 11. The apparatus according to any of claims 1-10, wherein die apparatus 
further comprises means for checking, when a fust thread has finished 
execution, if each of the threads on which the first thread may depend is ready 
to be committed and means for checking tor collision between die first thread 
and each of the threads on which the 5rsi thread may depend, which mean: for 
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checking comprises mean- for comparing die data structure associated with die 
first thread with each respective data structure associated with die threads on 
which the first diread may depend. 

5 12. The apparatus according to claim 11, wherein die apparatus further 
comprises means for creating a dependence list associated with the first diread 
before execution of the first diread, which dependence list includes a reference 
to each thread which has not yet been committed and which comes before die 
first diread in program order. 

10 

13. The apparatus according to claim 11 or 12, wherein the apparatus further 
comprises means for committing die first diread if no collision is detected 
between die first thread and any of the threads on which die first diread may 
depend and means for restarting execution of die first diread if a collision is 

15 detected between die first diread and any of die threads on winch die first 
diread may depend. 

14. The apparatus according to claim 13, wherein die apparatus further 
comprises means for delaying a restart of execution of die first diread until die 

2 0 diread or each of die threads wkh which the first thread has collided has been 
committed. 

15. The apparatus according to claim 14, wherein the apparatus farther 
comprises means for giving priority to committing and /or executing die diread 

2 5 or each of die threads with which die first diread has collided. 

16. The apparatus according io any of claims 11-15, wherein die apparatus 
further comprises means for performing partial check for collisions hctwecn 
the first thread and at least one of the threads on which die first thread nw 
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depend, which means for performing a partial check comprises means for 
comparing the data structure associated with the first thread with the respective 
data structure associated with the at least one of the threads on which the first 
diread may depend. 

5 

17. A method for recording information regarding accesses to a shared 
memory, which shared memory is accessible to a plurality of threads that are 
arranged to execute computer program instructions speculatively out of program 
order, which method includes the steps of: 
10 - a first of the plurality of threads accessing a selected memory element in the 
shared memory, and 

- the first thread storing information indicative of tire access to the selected 
memory element in a data structure associated with the first thread. 

15 18. The method according to claim 17, wherein the data structure is one of the 
following types of structures: an unsorted list, a sorted list, a tree and a table. 

19. The method according to claim 17, wherein each data structure comprises a 
number of bits that correspond to die memory elements of the shared memory 

20 and wherein the step of storing information comprises setting a chosen bit in tire 
data structure, which chosen bit corresponds to tire selected memory element. 

20. The method according to claim 19, wherein the data structure comprises a 
load vector and a store vector, wherein die chosen bit is a bit in die load vector 

25 if the first thread accesses die selected memory element, in order to read it, and 
wherein the. chosen bit is a bit in die store vector if the first diread accesses the 
selected memory element in order to write to it. 
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21. The method according to claim 19, wherein die data structure comprises a 
single combined load and store vector. 

22. The method according to claim 20 or 21, wherein mere is a one-to-one 
correspondence between the memory elements in the shared memory and the 
bits in die or each vector of the data structure. 

23. The mediod according to claim 20 or 21, wherein there is a many-to-one 
correspondence between die memory elements in die shared memory and the 
bits in die or each vector of die data structure. 

24. The method according to claim 23, wherein die correspondence between die 
bits in the or each vector and die memory elements is determined by means of 
mapping the memory elements to die bits in die or each vector using a hash 
function. 



25. The method according to any of claims 17-24, comprising die further steps 
of: 

- the first thread checking whether it has a private copy of die selected memory 
object; 

- if die first thread has a private copy and the first diread accesses die selected 
memory element in order to read it, the first thread reading from die pnvate 
copy; 

- if die first diread does not have a pnvate copy and die first diread accesses 

the selected memory element in order to read it, die first diread reading from 
the selected memory element in the shared memory; 

- if the first thread has a private cnpi and the first diread accesses the selected 

memory clement in order to write u, ,t. i!u- first thread writing to the private 
copy: and 
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- if the first thread does not have a private copy and die first diread accesses 
die selected memory element in order to write to it, die fust diread creating a 
private copy of the selected memory and writing to die private copy. 

26. The method according to any of claims 17-24, comprising die further steps 
of, if the first thread accesses the selected memory element in order to write to 
it, the first thread storing a backup copy of the selected memory element and die 
first diread wilting to die selected memory element in the shared memory after 
die backup copy is stored. 



27. A mediod for handling possible collisions between a plurality of threads, 
which threads are arranged to execute computer program instructions 
speculatively out of program order and to access memory elements of a shared 
memory, which mediod includes die steps of: 

15 executing a first-thread; 

checking, when die first diread has finished execution, if each of the threads on 
which die first diread may depend is ready to be committed.; 
waiting until each of the threads on which die first diread may depend is ready- 
to be committed, if each of die threads on which die first thread may depend is 

2 0 not ready to be committed; and 

checking for collision between die first diread and each of die threads on which 
the first diread may depend by means of comparing a data structure associated 
with die first thread with a data structure associated witii die thread on which 
the first diread may depend, which data structures stores information regarding 

25 which of die memory elements the thread with which die data structure is 
associated has accessed during execution of the thread. 

28. The method according to claim 2% wherein each data structure comprise? a 
number of bits that correspond to. the memory element;- of the shared memory 
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and wherein a bit is set if die memory element to which the bit corresponds has 
been accessed by die thread with which die data structure is associated during 
execution of die thread. 

29. The method according to claim 28, wherein each data structure comprises a 
load vector and a store vector, wherein a bit in the load vector is set if die 
memory object to which the bit corresponds has been read by die thread with 
which die data structure is associated during execution of the thread and 
wherein a bit in die store vector is set if die memory object to which die bit 
corresponds has been written to by die thread with which die data structure is 
associated during execution of die thread. 

30. The method according to claim 28, wherein each data structure comprises a 
single combined load and store vector. 

31. The mediod according to any of claims 27-30, wherein die method further 
compdses the step of creating a dependence list associated with die first thread 
before execution of die first thread, which dependence Est includes a reference 
to each thread which has not yet been committed and which comes before die 
first thread in program order. 

32. The mediod according to any of claims 27-31, wherein die first thread is 
committed if no collision is detected and wherein die execution of die first 
thread is restarted if a collision is detected. 

33. The mediod according to claim 32, wherein die restart of execution of die 
first thread is delayed until die thread or each of the threads with which die first- 
thread collided ha; been committed. 
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34. The method according to claim 33, wherein priority is given to committing 
and/or executing the thread or each of the threads with which the first thread 
collided. 

35. The method according to an}' of claims 27-34, comprising the further step of 
performing a partial check for collisions between the first thread and at least one 
of the threads on which the first thread may depend by means of comparing the 
data structure associated with the first thread with the respective data structure 
associated with the at least one of die threads on which the first thread may 
depend, wherein no locking of die data structures take place while the partial 
check is performed. 

36. A computer program product comprising computer code means for 
performing the method of any of claims 17-26 when run on a computer. 

37. A computer program product comprising computer code means for 
performing die metiiod of any of claims 27-35 when run on a computer. 
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